Basics of probability and distributions
Random variables & probability
Probability is the expression of belief in some future outcome
A random variable can take on different values with different probabilities
The sample space of a random variable is the universe of all possible values
Random variables & probability
- The sample space can be represented by a
- probability distribution (for discrete variables)
- probability density function (PDF - for continuous variables)
- algebra and calculus are used for each respectively
- the probabilities of an entire sample space always sum to 1.0
- There are many families or forms of distributions or PDFs
- depends on the nature of the dynamical system they represent
- the exact instantiation of the form depends on their parameter values
- we are often interested in statistics in estimating parameters
Bernoulli distribution
\[Pr(X=\text{Head}) = \frac{1}{2} = 0.5 = p \]
\[Pr(X=\text{Tails}) = \frac{1}{2} = 0.5 = 1 - p \]
Bernoulli distribution
- If the coin isn’t fair then \(p \neq 0.5\)
- However, the probabilities still sum to 1
\[ p + (1-p) = 1 \]
- Same is true for other binary possibilities
- success or failure
- yes or no answers
- choosing an allele from a population based upon allele frequencies
Probability rules
- Flip a coin twice
- Represent the first flip as ‘X’ and the second flip as ‘Y’
- First, pretend you determine the probability in advance of flipping both coins
\[ Pr(\text{X=H and Y=H}) = p*p = p^2 \] \[ Pr(\text{X=H and Y=T}) = p*p = p^2 \] \[ Pr(\text{X=T and Y=H}) = p*p = p^2 \] \[ Pr(\text{X=T and Y=T}) = p*p = p^2 \]
Probability rules
- Now determine the probability if the
H and T can occur in any order
\[ \text{Pr(H and T) =} \] \[ \text{Pr(X=H and Y=T) or Pr(X=T and Y=H)} = \] \[ (p*p) + (p*p) = 2p^{2} \]
- These are the ‘and’ and ‘or’ rules of probability
- ‘and’ means multiply the probabilities
- ‘or’ means sum the probabilities
- most probability distributions can be built up from these simple rules
Joint and conditional probability
Joint probability
\[Pr(X,Y) = Pr(X) * Pr(Y)\]
- Note that this is true for two independent events
- However, for two non-independent events we also have to take into account their covariance
Joint and conditional probability
Conditional probability
- For two independent variables
\[Pr(Y|X) = Pr(Y)\text{ and }Pr(X|Y) = Pr(X)\]
- For two non-independent variables
\[Pr(Y|X) \neq Pr(Y)\text{ and }Pr(X|Y) \neq Pr(X)\]
- Variables that are non-independent have a shared variance, which is also known as the covariance
- Covariance standardized to a mean of zero and a unit standard deviation is correlation
Binomial Distribution
- The distribution of probabilities for each combination of outcomes is
\[\large f(k) = {n \choose k} p^{k} (1-p)^{n-k}\]
n is the total number of trials
k is the number of successes
p is the probability of success
q is the probability of not success
- For binomial as with the Bernoulli
p = 1-q
Binomial Probability Distribution

Binomial Probability Distribution
- Note that the binomial function incorporates both the ‘and’ and ‘or’ rules of probability
- This part is the probability of each outcome (multiplication)
\[\large p^{k} (1-p)^{n-k}\]
- This part (called the binomial coefficient) is the number of different ways each combination of outcomes can be achieved (summation)
\[\large {n \choose k}\]
Poisson Probability Distribution
Poisson Probability Distribution
- For example, you can examine 1000 genes
- count the number of base pairs in the coding region of each gene
- what is the probability of observing a gene with ‘r’ bp in it?
Pr(Y=r) is the probability that the number of occurrences of an event y equals a count r in the total number of trials
\[Pr(Y=r) = \frac{e^{-\mu}\mu^r}{r!}\]
Poisson Probability Distribution
- Note that this is a single parameter function because \(\mu = \sigma^2\)
- The two together are often just represented by \(\lambda\)
\[Pr(y=r) = \frac{e^{-\lambda}\lambda^r}{r!}\]
- This means that for a variable that is truly Poisson distributed:
- the mean and variance should be equal to one another
- variables that are approximately Poisson distributed but have a larger variance than mean are often called ‘overdispersed’
- quite common in RNA-seq and microbiome data
Poisson Probability Distribution | gene length by bins of 500 nucleotides

Poisson Probability Distribution | increasing parameter values of \(\lambda\)

Log-normal PDF | Continuous version of Poisson (-ish)

Binomial to Normal | Categorical to continuous

The Normal (aka Gaussian) | Probability Density Function (PDF)

Normal PDF

Normal PDF | A function of two parameters
(\(\mu\) and \(\sigma\))
where \[\large \pi \approx 3.14159\]
\[\large \epsilon \approx 2.71828\]
To write that a variable (v) is distributed as a normal distribution with mean \(\mu\) and variance \(\sigma^2\), we write the following:
\[\large v \sim \mathcal{N} (\mu,\sigma^2)\]
Normal PDF | estimates of mean and variance
Estimate of the mean from a single sample
\[\Large \bar{x} = \frac{1}{n}\sum_{i=1}^{n}{x_i} \]
Estimate of the variance from a single sample
\[\Large s^2 = \frac{1}{n-1}\sum_{i=1}^{n}{(x_i - \bar{x})^2} \]
Normal PDF

Why is the Normal special in biology?

Why is the Normal special in biology?

Why is the Normal special in biology?

Parent-offspring resemblance

Genetic model of complex traits

Distribution of \(F_2\) genotypes | really just binomial sampling

Why else is the Normal special?
- The normal distribution is immensely useful because of the central limit theorem
- The mean of many random variables independently drawn from the same distribution is distributed approximately normally
- One can think of numerous situations, such as
- when multiple genes contribute to a phenotype
- or that many factors contribute to a biological process
- In addition, whenever there is variance introduced by stochastic factors or sampling error, the central limit theorem holds
- Thus, normal distributions occur throughout biology and biostatistics
z-scores of normal variables
- Often we want to make variables more comparable to one another
- For example, consider measuring the leg length of mice and of elephants
- Which animal has longer legs in absolute terms?
- Which has longer legs on average proportional to their body size?
- Which has more variation proportional to their body size?
- A good way to answer this last question is to use ‘z-scores’
z-scores of normal variables
- z-scores are standardized to a mean of 0 and a standard deviation of 1
- We can modify any normal distribution to have a mean of 0 and a standard deviation of 1
- Another term for this is the standard normal distribution
\[\huge z_i = \frac{(x_i - \bar{x})}{s}\]
R Interlude | Complete Exercise 3.1
Linear Models and Regression
Parent offspring regression

Linear Models - a note on history

Linear Models - a note on history

Bivariate normality

Covariance and correlation

A linear model to relate two variables

Parameter Estimation | Ordinary Least Squares (OLS)
- Algorithmic approach to parameter estimation
- One of the oldest and best developed statistical approaches
- Used extensively in linear models (ANOVA and regression)
- By itself only produces a single best estimate (No C.I.’s)
- Many OLS estimators have been duplicated by ML estimators
Many approaches are linear models
- Flexible - applicable to many different study designs
- Provides a common set of tools (
lm in R for fixed effects)
- Includes tools to estimate parameters:
- sizes of effects like the slope
- difference in means among categories
- Is easier to work with, especially with multiple variables
Many approaches are linear models
- Linear regression
- Single factor ANOVA
- Analysis of covariance (ANCOVA)
- Multiple regression
- Multi-factor ANOVA
- Repeated-measures ANOVA
Plethora of linear models
General Linear Model (GLM) - two or more continuous variables
General Linear Mixed Model (GLMM) - a continuous response variable with a mix of continuous and categorical predictor variables
Generalized Linear Model - a GLM that doesn’t assume normality of the response
Generalized Additive Model (GAM) - a model that doesn’t assume linearity
Linear models
All can be written in the form
response variable = intercept + (explanatory_variables) + random_error
in the general form:
\[ Y=\beta_0 +\beta_1*X_1 + \beta_2*X_2 +... + \epsilon\]
where \(\beta_0, \beta_1, \beta_2, ....\) are the parameters of the linear model
linear model parameters

linear models in R
- Need to fit the model and then ‘read’ the output
- In general you will ask R to fit linear models and then do additional analyses on the output
Model fitting and hypothesis tests in regression
\[H_0 : \beta_0 = 0\] \[H_0 : \beta_1 = 0\]
full model - \(y_i = \beta_0 + \beta_1*x_i + error_i\)
reduced model - \(y_i = \beta_0 + 0*x_i + error_i\)
- fits a “reduced” model without slope term (\(H_0\))
- fits the “full” model with slope term added back
- compares fit of full and reduced models using an F test
Model fitting and hypothesis tests in regression

Hypothesis tests in linear regression

Hypothesis tests in linear regression

Relationship of correlation and regression
\[\beta_{YX}=\rho_{YX}*\sigma_Y/\sigma_X\] \[b_{YX} = r_{YX}*S_Y/S_X\]
Residual Analysis | did we meet our assumptions?
- Independent errors (residuals)
- Equal variance of residuals in all groups
- Normally-distributed residuals
- Robustness to departures from these assumptions is improved when sample size is large and design is balanced
Residual Analysis | did we meet our assumptions?
\[y_i = \beta_0 + \beta_1 * x_I + \epsilon_i\]
Residual Analysis

Residual Analysis

Residual Plots | Spotting assumption violations

Anscombe’s quartet | what would residual plots look like for these?

Anscombe’s quartet | what would residual plots look like for these?
